library(tidyverse)
# Genomics helpers (focus today)
library(ASRgenomics)Genomic Data Exploration
Previously: Spatial, Multi-trait and Multi-stage Models
Lesson plan
- Download example SNP datasets from Canvas and get to know them
- Read a SNP dataset into R in a form suitable for genomic analyses.
- Diagnose common issues: missingness, minor allele frequencies, heterozygosity.
- Visualize population structure using PCA.
Intro
File Formats
As we will discuss in lecture, there are many/several data formats the genomic data are stored in.
It is valuable to familiarize yourself with them.
FASTA: Raw sequence reads, aligned to a reference genome sequence
A sequence begins with a greater-than character (“>”) followed by a description of the sequence (all in a single line). The lines immediately following the description line are the sequence representation, with one letter per amino acid or nucleic acid, and are typically no more than 80 characters in length. ~Wikipedia
Example fasta file. Nucleotide sequences from a single Illumina-type short-read. Usually ~150bp long.
VCF: Variance Call Files. After aligning raw reads to a genome and after quality control steps, variant discovery can be conducted. Variant discovery is the process of determining which genomic loci contain naturally varying sequences. VCF files are simply text files, no special formatting. However, b/c the data files get quite large, you’ll often find them in a compressed form (ending usually in
.vcf.gz).We are concerned primarily (in this class) with the most common type of variant– the single nucleotide polymorphism (SNP). SNP are usually (not always) bi-allelic in a given population; this means that only two different versions (alleles) are found. E.g. A vs. T or G vs. C.
Official specification here: https://samtools.github.io/hts-specs/VCFv4.2.pdf
An example depicted the major components of a VCF file.
Hapmap: often, in crop science, our genotyping data are processed into a simpler-than-vcf form. The HMP file is a common one.
An example of the HMP (hapmap) format, from the GAPIT manual. The first 11 columns display attributes of the SNPs and the remaining columns show the nucleotides observed at each SNP for each taxa. The first row contains the header labels and each remaining row contains all the information for a single SNP. ~GAPIT manual
“plink” format:
plinkis a really useful program for manipulating and conducting some population genomics analyses. The originalplinkformat involves two files a.pedand a.map. In the last 10 years, (plink1.9–> nowplink2) have supplanted that with a binary version.Overview of various commonly used PLINK files. SNP = single nucleotide polymorphism - Thanks in part to the binary data format,
plink1.9is lightening fast, even on laptops and for large genomic data. Each of the software formats and file types described above have their uses.
Command Line (No Excel)
To work with these data files, you will not be able to use EXCEL or R (maybe R actually, but not usually for the initial steps or the biggest data).
Working with these data requires command line interface (CLI) and usually a Linux or Unix shell are used.
NOTE: I am working on getting our class access the Auburn High Performance Computing (HPC) cluster. In a future class, we can learn to work with these data types. Doing so will require a unified computing environment.
Installing an compiling these programs on everyone’s laptops will be a challenge because of different hardware and software configurations.
Working at the Command Line: While Mac and Linux are built on a common sub-system (called Unix), Window’s OS is not.
- Windows: To access a Unix-like command line environment, you can install the “Windows Subsystem for Linux” (WSL): Link to Install Page.
- If your working a Mac, or are into Linux (🐧) you can simply open the “Terminal” application.
Learning Command Line:
- Software Carpentry - “The Unix Shell” - free tutorial / lessons on the basics
- Datacamp - “Introduction to Shell” - free tutorial / lessons on the basics
- Others
- https://www.codecademy.com/learn/learn-the-command-line/modules/learn-the-command-line-navigation/cheatsheet
- Secure Copy Protocol (SCP) - Transfer Files using SSH & Command Line on Windows 10 to Linux / Other
- SSH Client on Windows 10 Using the Command Prompt | SSH from Windows to Linux and Other Systems
Common Command Line Programs for Genomic Variant Datasets:
There are so many options and its evolving quick. It depends on what you want / need to do.
However, here are a few of the core software I recommend:
vcftoolsbcftools- plink1.9(somehow, bewildering called a “beta” version) or
plink2(“alpha”). TASSELand nowrTASSEL
Isik, Holland, and Maltecca (2017) Chapter 9: Exploratory Marker Analysis
SNP data in R
Our main hands-on goal today will be to see how SNP data can be represented in R. We will learn a few initial tricks about manipulating and visualizing those data.
Data on Canvas
Download the three datasets (described below) from the “Data” module on Canvas. Place them in the data/ sub-directory of your R project / repository.
I do not use all of them in my demo. They are for you to explore!
- White Lupins (Lupinus albus):
WhiteLupins_KHU164_SNPcalls.Ihapmap: An imputed HapMap (~340MB) on a collection of >400 genotypes at ~250K SNPs. These data were generated by low-pass Illumina short-read sequencing of each sample by HudsonAlpha’s Khufu team. As part of the genotyping service, missing data are imputed by a proprietary Khufu algorithm.wl_metadata.csv: This file provides some meta information that will be useful for exploring the data. Genotypes are labelled according to:- Population-of-origin (one developed at AU and another representing global diversity from the NPGS [National Plant Germplasm System]).
- Plant Type (
PlantArch): Lupins vary in their plant architecture. Some have DETERMINATE flowering from only the apical branch, others are INDETERMINATE.
- Oats (Avena sativa):
SunOat2022-3K_01-05.AB.txt: was generated by USDA-ARS using a 3K SNP chip; file resembles a HapMap file and contains genotypes for 480 oat accessions. Some are named, others are experimental lines. These data are attributed to Drs. Ali Babar (UF) and Steve Harrison (LSU) as part of the SunGrains Breeding Cooperative.
- Cassava (Manihot esculenta):
cassava_workshop_data.zip: You’ll want to download and unzip this.- Multiple file types are provided. These data are all publicly available and downloaded from www.cassavabase.org. The entire process is documented in a 2022 Workshop on Genomic Selection.
- The data were generated using a combination of genotyping-by-sequencing and a similar reduced-representation marker platform called “DArTseqLD”. They have been pre-imputed.
Install ASRgenomics
Since we’ve been using asreml I thought it useful to include another R package by the same team (VSNi) - ASRgenomics.
To install, go this VSNi page, and click the “Download” button. Follow the provided instructions. You’ll need to fill the web-form to get the download. Install via Rstudio.
Read oat data
First off, we’ll read the smallest dataset (the Oat data) into R.
# Be sure to match the file path to your LOCAL machine repository
oats<-read_delim(here::here("data/SunOat_3KSNP_data","SunOat2022-3K_01-05.AB.txt"))
oats %>% dim[1] 2989 483
oats[1:5,1:10]# A tibble: 5 × 10
`#ID` CHROM POS BIG_MAC BOB BROOKS Caballo Cantara COKER_227 COKER_234
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 avgbs_1… chr4A 4.58e8 NULL AA BB AA BB NC NULL
2 avgbs_1… chr1C 4.57e8 BB BB BB BB BB BB BB
3 avgbs_1… chr4A 4.13e8 AA AA AA AA AA AA AA
4 avgbs_1… chr6C 1.41e8 BB AA AA AA AA AA AA
5 avgbs_1… chr3D 4.68e8 BB BB AA BB AA BB BB
This dataset has markers on the rows and genotypes of accessions on the columns.
Convert to “dosage”
ASRgenomics expects a genotype matrix M with individuals in rows, markers in columns, coded 0/1/2, with rownames = individual IDs and colnames = marker IDs. See qc.filtering(), snp.pca(), G.matrix() docs.
Example format:
| id | snp1 | snp2 | snp3 | … |
|---|---|---|---|---|
| G001 | 0 | 1 | NA | … |
| G002 | 2 | 0 | 1 | … |
This format is known as “allele dosage”. The numbers represent a digitization of the genotype values (HOM REF, HET, HOM ALT) that is most useful for statistical genetics. “Allele dosage” assumes only two alleles are present and counts the number of “doses” an individual posesses of one of the two alleles; typically the “alternative” allele is counted. “Alternative” refers to the allele that IS NOT represented in the reference genome sequence. Sometimes, the major or minor allele are counted instead.
Point is, something like 0 = AA, 1 = AT, 2 = TT is the encoding we want.
ASRgenomics provides a helper function snp.recode() that can convert a HapMap (ATGC) format to dosage.
This works on the White Lupins dataset.
However, it doesn’t work on the oat dataset I chose as my example. 🤦
Below, I work out a pipeline that manually recodes the data. I recommend you reverse engineer it and know that different data sources pose different manipulation problems.
NOTE: If you’ve got 1000’s of individuals and millions of SNP, R might not be the right place to start.
oats_recoded<-oats %>%
# Sort by chrom AND pos
arrange(CHROM,POS) %>%
# Need an SNP_ID that can be put in the "rownames" and keeps track of chrom+pos+alleles
mutate(SNP_ID=paste0(CHROM,"_",POS,":",`#ID`)) %>%
# Noticed some markers (rows) were not associated with a chromosome
## or had different chrom naming than the rest
## to be careful, I'll remove them
filter(!CHROM %in% c("UN","ChrUn","1A","2C","4D","5C","6A")) %>%
# Move the SNP_ID column into the rownames position
column_to_rownames(var = "SNP_ID") %>%
# Remove meta data columns
select(-`#ID`,-CHROM,-POS)
# Use mutate() on all columns
# Find AA turn it into a 0
# AB = 1
# BB = 2
# Should get NA otherwise
oats_dosage<-oats_recoded %>%
mutate(across(.cols=everything(),
~ ifelse(.x=="AA","0",
ifelse(.x=="AB","1",
ifelse(.x=="BB","2",.))))) %>%
mutate(across(.cols=everything(),~as.numeric(.)))
oats_dosage<-t(oats_dosage) %>%
as.matrix()
oats_dosage[1:5,1:5] chr1A_82978:GMI_ES02_c19502_234 chr1A_4617566:GMI_ES17_c4089_545
BIG_MAC NA 0
BOB NA 2
BROOKS NA 0
Caballo NA 2
Cantara NA 0
chr1A_6688611:avgbs_115256 chr1A_10592626:avgbs2_19724
BIG_MAC 1 2
BOB 2 2
BROOKS 1 2
Caballo 2 2
Cantara 1 2
chr1A_20166187:GMI_ES17_c2656_146
BIG_MAC 1
BOB 2
BROOKS 1
Caballo 2
Cantara 1
table(oats_dosage,useNA = 'ifany')oats_dosage
0 1 2 <NA>
590754 81033 700431 56742
Summary stats
Here are some handy calculations you can now do without any package:
# Proportion missing overall
mean(is.na(oats_dosage))[1] 0.0397086
# The allele frequency (of the counted alleles)
af<-colMeans(oats_dosage, na.rm = T)/2
summary(af) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.2185 0.5743 0.5390 0.8725 1.0000 51
Usually, in statistical genetics applications, we want to know the minor allele frequency (MAF).
Why? Statistical reasons. Most datasets have lots of rare alleles. If genotypes at a locus are your predictors in a downstream analysis, you’ll not have good power to determine the phenotype-genotype effect for very rare alleles. Imagine you do an experiment to test for a fertilizer effect. You’ve got 100 plots. A typical design will have 50 plots WITH vs. 50 plots WITHOUT fertilizer. Why? Sampling error and statistical power. If you observe 2 plots WITH fert. and 95 WITHOUT fert. Do you think you have a good estimate? Recall the sensitivity of fixed-effects to imbalanced data. So to is the case with marker genotypes. Stay tuned for more coverage of this concept / problem.
Rare alleles are informative, but unstable for some analyses; hence filtering is needed.This is why QC choices matter downstream.
# Here is a simple function to compute the minor allele frequencies
getMAF<-function (M){
freq <- colMeans(M, na.rm = T)/2
maf <- freq
maf[which(maf > 0.5)] <- 1 - maf[which(maf > 0.5)]
return(maf)
}
maf<-getMAF(oats_dosage)
summary(maf) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00000 0.05021 0.16614 0.19162 0.31746 0.50000 51
hist(maf)QC filtering
ASRgenomics provides qc.filtering() to do basic QC filtering and simple imputation.
“Imputation is the process of replacing missing data with substituted values.” ~Wikipedia
Imputation is often needed for downstream applications that are only available for ‘complete data’.
We’ll use “mean-imputation” here. This means simply putting the sample mean in place of any missing data value.
In genomics, there are MUCH better ways to impute. Here’s a link to the program Beagle (just an example not an endorsement, Browning, Zhou, and Browning (2018)). Beagle uses what is called a “hidden Markov model” to probabilistically infer missing values; an extremely powerful computational approach for genomics, given genome structure.
# qc.filtering expects 0/1/2 with NAs allowed.
# Key knobs: call rate, MAF, and imputation approach.
qc <- qc.filtering(M=oats_dosage,
maf = 0.01, # keep markers with MAF >= 0.01
marker.call = 0.95, # keep SNPs called in >= 95% of indivs
ind.call = 0.80, # keep individuals with >= 80% SNPs called
imput = TRUE) # simple mean imputation# What did we get back?
names(qc)[1] "M.clean" "map" "plot.missing.ind" "plot.missing.SNP"
[5] "plot.heteroz" "plot.Fis" "plot.maf"
oats_dosage_cleaned<-qc$M.clean
dim(oats_dosage_cleaned)[1] 480 2521
mean(is.na(oats_dosage_cleaned))[1] 0
Several pre-made plots are provided. Pretty convenient.
Oats are self-pollinating small grains. Oat breeders typically (not always) wait until new lines have been inbred to the level of (>99% homozygosity) before genotyping.
For that reason, a useful sanity check in this case is the rate of heterozygosity.
# According to the manual,
# this is a plot of observed heterozygosity per SNP (original marker matrix).
qc$plot.heterozIn fact, Missingness by individual may be much more relevant in this case.
Here’s a function to do missing-by-individual:
getPropHom<-function (M) {
W <- M
W[which(W == 2)] <- 0
f <- 1 - (rowSums(W)/ncol(W))
return(f)
}prop_hom<-getPropHom(oats_dosage_cleaned)
summary(prop_hom) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2483 0.9101 0.9149 0.9092 0.9179 0.9288
hist(prop_hom)So, this sanity checks us. But there ARE some pretty heterozygous individuals.
prop_hom %>% .[which(.<0.8)] NC14-1787 LA17089SBSS-73-1 TX20CAS1079 NC19-3542
0.7936335 0.7751210 0.2483380 0.7395914
What do you think could explain this?
BTW, I’ve encapsulated several of these useful functions into my own R package, genomicMateSelectR.
If you’re interested, find it here: https://wolfemd.github.io/genomicMateSelectR/
You can install genomicMateSelectR package from my GitHub with:
devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master')PCA to visualize structure
ASRgenomics offers snp.pca() for PCA on the marker matrix. But you can also use the core R function prcomp().
Principal Components Analysis (PCA): linear dimensionality reduction technique often used in exploratory data analysis, visualization and data preprocessing.
Learn PCA:
- Data Camp Tutorial on PCA in R
- StatQuest with Josh Starmer - PCA on YouTube
- Reich, Price, and Patterson (2008) - An early paper reviewing the use of PCA with genetic data
- Price et al. (2006) - Jumping the gun a bit, but PCA’s use to control for population genetic structure in GWAS is first described in this paper.
pca_on_snps <- snp.pca(M = oats_dosage_cleaned,ncp = 15)
# 10 PCs for exploration
names(pca_on_snps)[1] "pca.scores" "eigenvalues" "plot.scree" "plot.pca"
# PCA Scores
pca_on_snps$pca.scores %>%head PC1 PC2 PC3 PC4 PC5 PC6
BIG_MAC -0.8115166 0.5312307 -0.2618582 0.3511487 -0.23905513 -0.25916915
BOB -0.0490075 1.0561610 -0.6368674 -0.1735671 -0.14564473 -0.35001333
BROOKS -0.3640837 -0.2718282 -0.5761865 -0.4152877 -0.04842491 0.12150877
Caballo -1.0215414 1.0254975 -0.5482151 -0.1670740 -0.09641643 -0.08068248
Cantara -0.3632970 -0.2740405 -0.5834303 -0.4173835 -0.04428983 0.13075484
COKER_227 -1.2789803 0.9881427 -0.1254722 0.1159675 0.27366714 -0.20148855
PC7 PC8 PC9 PC10 PC11 PC12
BIG_MAC 0.2363907 -0.0006181521 0.1296274 -0.4084578 0.0569676 -0.20705046
BOB 0.5891244 -0.0929030801 1.1146056 -0.4584778 0.1281358 -0.00508864
BROOKS 0.1101220 -0.0862704449 -0.2734634 0.1983333 -0.1086180 0.31441081
Caballo -0.1622409 -0.0660100764 0.2196498 0.4090137 0.1415139 0.33121767
Cantara 0.1054150 -0.0834733323 -0.2731552 0.1939836 -0.1104019 0.31561090
COKER_227 -0.2493228 0.1131566988 0.3039956 -0.2764918 -0.1882310 0.10051105
PC13 PC14 PC15
BIG_MAC 0.26894315 -0.07602738 0.3262893
BOB -0.03380547 -0.13040304 0.1573659
BROOKS 0.12182094 0.21661256 0.4839442
Caballo 0.34046723 -0.25478573 0.4246931
Cantara 0.12998679 0.21764663 0.4886590
COKER_227 0.01373906 -0.12966889 0.2493736
PCA scores position each individual (oat in this case) in the principal components ‘space’. These PCA scores are also called eigenvector coeffients. Each PC is also called an eigenvector and has an associated eigenvalue which corresponds to the amount of multi-variance variability that is explained by that vector.
pca_on_snps$eigenvalues %>% head eigenvalue variance.percent cumulative.variance.percent
Dim.1 1.3526787 28.220223 28.22022
Dim.2 0.5804136 12.108862 40.32908
Dim.3 0.4634914 9.669576 49.99866
Dim.4 0.2254203 4.702825 54.70149
Dim.5 0.1822079 3.801308 58.50279
Dim.6 0.1483505 3.094959 61.59775
We can use their pre-made plot
pca_on_snps$plot.screeA “scree plot” displays the percent variance explained by each principal component (eigenvector) in order. As you can see, by definition the first one always explains the most.
pca_on_snps$plot.pcascores <- pca_on_snps$pca.scores
scores <- scores %>%
as.data.frame %>%
mutate(GID = rownames(scores))
ggplot(scores, aes(x = PC3, y = PC4)) +
geom_point(size = 2, alpha = 0.8) +
labs(title = "PCA of SNP matrix", x = "PC3 (9.7%)", y = "PC4 (4.7%)") +
theme_bw()On interpretation: Each data point is an individual oat genotype. The position of two datapoints in the plot is related to the genetic similarity between the two samples. Without additional information, this information is not immediately translatable into pedigree information, meaning you can’t tell whether two samples are siblings, parent-offspring, or any other relatedness level. Often, structures like families will appear as clusters of dots.
snp.pca has options to color code and to draw ellipses around groups of related individuals.
Hint for the Hands-on: The White Lupin dataset comes with a metadata file that could be used to group those samples.
Hint for visualizing the oats: Sample names often start with a state-of-origin prefix (e.g. “FL0047-J9” is a line from Florida). I wonder if we can identify clusters-by-origin? You’ll have to use some text processing (grep, grepl) to extract state codes from the names.
rownames(oats_dosage_cleaned)[1:25] [1] "BIG_MAC" "BOB" "BROOKS"
[4] "Caballo" "Cantara" "COKER_227"
[7] "COKER_234" "Coker_716" "Coronado"
[10] "FL0047-J9" "FL0115-J2" "FL03001BSB-S7"
[13] "FL03146BSB-S1-B-S1" "FL03254-L1" "FL04155-S06-31-B-S1"
[16] "FL0720-R6" "FL0772-R3" "FL0914-U2"
[19] "FL0923-U2" "FL0923-U4" "FL0941-U1"
[22] "FL0943-U2" "FL0943-U4" "FL1013-1"
[25] "FL1013-2"
Marker distribution
One last thing, to get us primed for mapping quantitative trait loci (QTL) along the genome. Let’s make a plot of the marker distribution across the physical positions of the genome. Are they evenly distributed, or are there signifcant gaps? If designed properly, a genotyping strategy should evenly sample the whole genome.
snp_map<-tibble(SNP_ID=colnames(oats_dosage_cleaned)) %>%
separate(SNP_ID,c("ChrPos","SNP_ID"),":") %>%
separate(ChrPos,c("Chr","Pos"),"_") %>%
mutate(Pos=as.numeric(Pos))
snp_map %>%
ggplot(., aes(x = Pos, y = Chr)) +
geom_point(size = 0.6, alpha = 0.7) +
labs(title = "Physical distribution of SNP markers",
x = "Genomic position (bp)",
y = "Chromosome") +
theme_bw()Hands-on
Now it’s your turn!
- Explore the available SNP datasets. Use the examples above, but check out the manual, see what you can do with what you’ve got. Try to get to go beyond the PCA we demonstrated above. Identify sub-groups in the data (hints given above) and color-code and see if you can distinguish the groups in genetic space using the PCA results.
- Work on learning to interact with the shell (command line). If you need to, install WSL or find another way to get up and running with some of the software mentioned above (e.g.
plink).